skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Doshi, F R"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Self-supervised Vision Transformers (ViTs) like DINOv2 show strong holistic shape processing capabilities, a feature linked to computations in their intermediate layers. However, the specific mechanism by which these layers transform local patch information into a global, configural percept remains a black box. To dis- sect this process, we conduct fine-grained mechanistic analyses by disentangling patch representations into their constituent content and positional information. We find that high-performing models demonstrate a distinct multi-stage processing signature: they first preserve the spatial localization of image content through many layers while concurrently refining their positional representations. Compu- tationally, we show that this is supported by a systematic "local-global handoff," where attention heads gradually shift to aggregating information using long-range interactions. In contrast, models with poor configural ability lose content-specific spatial information early and lack this critical positional refinement stage. This positional refinement is further stabilized by register tokens, which mitigate a common artifact in ViTs; repurpose low-information patch tokens into high-norm ’outliers’ to store global information, causing them to lose their local positional grounding. By isolating these high-norm activations in register tokens, the model better preserves the visual grounding of each patch, which we show also leads to a direct improvement in holistic processing. Overall, our findings suggest that holis- tic vision in ViTs arises not just from long-range attention, but from a structured pipeline that carefully manages the interpl 
    more » « less
    Free, publicly-accessible full text available December 2, 2026